Search CORE

64 research outputs found

Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

Author: Pelloni Olga
Samardžić Tanja
Shaitarova Anastassia
Publication venue
Publication date: 11/12/2022
Field of study

Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the performance on various tasks in low-resource languages via cross-lingual transfer. In this framework, English is usually seen as the most natural choice for a transfer language (for fine-tuning or continued training of a multilingual pre-trained model), but it has been revealed recently that this is often not the best choice. The success of cross-lingual transfer seems to depend on some properties of languages, which are currently hard to explain. Successful transfer often happens between unrelated languages and it often cannot be explained by data-dependent factors.In this study, we show that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by our novel measure of Subword Evenness (SuE). Transferring language models over the languages that score low on our measure results in the lowest average perplexity over target low-resource languages. Our correlation coefficients obtained with three different pre-trained multilingual models are consistently higher than all the other predictors, including text-based measures (type-token ratio, entropy) and linguistically motivated choice (genealogical and typological proximity)

ZORA

The Scope and the Sources of Variation in Verbal Predicates in English and French

Author: Kashaeva Goljihan
Merlo Paola
Samardžić Tanja
van der Plas Lonneke
Publication venue
Publication date: 01/12/2010
Field of study

Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 199-210. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

DSpace at Tartu University Library

Digitising Swiss German : how to process and study a polycentric spoken language

Author: Glaser Elvira
Samardžić Tanja
Scherrer Yves
Publication venue
Publication date: 29/11/2019
Field of study

Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.Peer reviewe

Crossref

ZORA

Helsingin yliopiston digitaalinen arkisto

ArchiMob : A multidialectal corpus of Swiss German spontaneous speech

Author: Glaser Elvira
Samardžić Tanja
Scherrer Yves
Publication venue
Publication date: 01/01/2019
Field of study

Alemannische Dialektologie – Forschungsstand und Perspektiven. SonderheftPeer reviewe

Directory of Open Access Journals

ZORA

Helsingin yliopiston digitaalinen arkisto

BOP Serials

Automatic interlinear glossing as two-level sequence classification

Author: Samardžić Tanja
Schikowski Robert
Stoll Sabine
Publication venue: s.n.
Publication date: 01/01/2010
Field of study

We discuss the aspect of synchronisation in the language design and implementation of the asynchronous data flow language S-Net. Synchronisation is a crucial aspect of any coordination approach. S-Net provides a particularly simple construct, the synchrocell. As a primitive S-Net language construct synchrocell implements a one-off synchronisation of two data items of different type on a stream of such data items. We believe this semantics captures the essence of synchronisation, and no simpler design is possible. While the exact built-in behaviour as such is typically not what is required by S-Net application programmers, we show that in conjunction with other language features S-Net synchrocells meet typical demands for synchronisation in streaming networks quite well. Moreover, we argue that their simplistic design, in fact, is a necessary prerequisite to implement an even more interesting scenario: modelling state in streaming networks of stateless components. We finish with the outline of an efficient implementation by the S-Net runtime system

Crossref

ZORA

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Jezična akomodacija na Twitteru: Primjer Srbije

Author: Maja Miličević Petrović
Nikola Ljubešić
Tanja Samardžić
Publication venue: Slavistično društvo Slovenije
Publication date: 01/03/2019
Field of study

U ovom radu istražujemo fenomen jezične akomodacije kod srpskih korisnika Twittera analizirajući geokodirane poruke objavljene u razdoblju između 2013. i 2016. godine na području Bosne i Hercegovine, Crne Gore, Hrvatske i Srbije. Jezičnu produkciju korisnika Twittera opi- sujemo s pomoću 16 varijabli za koje je poznato da variraju među govornicima policentričnog makrojezika BCHS. Uspoređujemo jezičnu produkciju mobilnih srpskih korisnika Twittera s produkcijom nemobilnih srpskih korisnika, kao i produkciju mobilnih korisnika u Srbiji i izvan nje. Dok prva analiza djelomično podržava teoriju akomodacije, druga analiza ne daje nikakve naznake tog fenomena

Directory of Open Access Journals

Part-of-Speech Tag Disambiguation by Cross-Linguistic Majority Vote

Author: Aepli Noëmi
Samardžić Tanja
von Waldenfels Ruprecht
Publication venue: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages
Publication date: 23/08/2014
Field of study

ZORA

Composition of lipid extract of wheat, corn and sunflower harvest residues

Author: Lužaić Tanja
Maksimović Zoran
Romanić Ranko
Samardžić Stevan
Publication venue
Publication date: 01/01/2022
Field of study

Usled stalnog porasta broja stanovnika raste i potreba za hranom u svetu, što dovodi do povećavanja obradivih površine pod žitaricama i uljaricama. Raste i količina žetvenih ostataka koji se najčešće spaljuju. Spaljivanje žetvenih ostataka predstavlja veliki ekološki rizik, sa jedne strane, jer je čest uzročnik požara, dok sa druge strane predstavlja vrednu biomasu koja ostaje neiskorišćena. U poslednjih nekoliko godina je primećen trend spaljivanja ostataka na polju što dovodi do zagađenja vazduha i predstavlja opasnost po zdravlje stanovništva. Žetveni ostaci sadrže različite komponente koje bi mogle naći svoju primenu u prehramebnoj i farmaceutskoj industriji. Analizom sastava lipidnog ekstrakta žetvenih ostataka utvrđeno je prisustvo biološki vrednih komponenata koje dalje mogu naći svoju primenu u proizvodnji mesnih prerađevina sa poboljšanom oksidativnom stabilnošću, boljom održivošću, poboljšanim sastavom masnih kiselina, kao i novim formulama prirodne kozmetike.Rad u istaknutom nacionalnom časopisu (M52

FarFar - Repository of the Faculty of Pharmacy, University of Belgrade